THE DISENTANGLING NUMBER FOR PHYLOGENETIC MIXTURES 

SETH SULLIVANT 

Abstract. We provide a logarithmic upper bound for the disentangling number on 
unordered lists of leaf labeled trees. This results is useful for analyzing phylogenetic 
mixture models. The proof depends on interpreting multisets of trees as high dimensional 
contingency tables. 

o 

For a set X of leaf labels let Tx be the set of trivalent leaf labeled trees (see [5] 
i—) for background on leaf labeled trees). Typically, the labels come from the set [n] = 

■^j- {1,2, ... ,n}. For K C X and T G Tx, let T\k denote the trivalent tree T\k G Tk 

obtained by restricting T to the label set of leaves K, contracting vertices of degree two 
, , as necessary to obtain trivalent tree. Let Tx,r be the set of unordered lists of length r 

of elements of Tx- Note that these elements need not be distinct. For S G Tx,r, with 
O S = (Ti,...,T r ), let S\ K = {Ti\ K ,...,T r \ K ). 
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Figure 1. The right-hand tree is T\u t 2,4,T} for the tree T on the left. 



Definition 1. Let Si, S 2 G Tx, r with Si ^ S 2 . A subset K C X is said to disentangle Si 
and ^2 if Si\k ^ S 2 \k- Let d(Si, S 2 ) be the cardinality of the minimum disentangling set 
^ of Si and S 2 . The disentangling number D(r) is 

c3 D(r) = max max dfSi,^). 

neN Si^S 2 67[ n]ir 

Humphries j2] proved that the disentangling number exists, and gave the bounds 

3([log 2 rJ +1) < D(r) < 3r. 

At present, the only exactly known values of the disentangling number are D(l) = 4 and 
D{2) = 6 [4J. This first value D(l) = 4, is usually stated as saying that "the quartets 
determine the tree" (see e.g. [3]). 

The main motivation for studying the disentangling number is that it can be used 
as a tool in proofs of the identifiability of the tree parameters in phylogenetic mixture 
models. Indeed, if it can be shown, for some value s > D(r), that the tree parameters 
of r class mixtures on s-leaf phylogenetic trees are (generically) identifiable under some 
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phylogenetic model, then the tree parameters are (generically) identifiable for r class 
mixtures all trees with t > s leaves. For example, the known value of D(2) = 6 was used 
in the proof of generic identifiability of the tree parameters of 2-tree Jukes-Cantor and 
Kimura 2-parameter mixture models pQ. 

We provide the following improved upper bound on the disentangling number, which 
is within one of the optimal possible value. 

Theorem 2. D(r) < 3(Llog 2 (r)J + 1) + 1 

To prove Theorem [2] we first reduce to rooted binary trees, as follows. Let IZTx denote 
the set of leaf labeled rooted binary trees on leaf label set X. For T G IZTx and ifCI, 
let T\ K be the induced binary rooted tree on leaf label set K, with edges contracted as 
appropriate to obtain a rooted binary tree. Let 7ZTx, r be the set of unordered lists of length 
r of elements of IZTx- Note that these elements need not be distinct. For S G 7ZTx, r , 
with S = (Xi, . . . , T r ), let S\k = (Ti\k, ■ ■ ■ ,T r \x)- Define the rooted disentangling number 
RD(r) in an analogous way to the disentangling number. 

Proposition 3. The disentangling and rooted disentangling numbers satisfy: D(r) < 
RD(r) + 1. 

Proof. Let n > RD(r), and consider a set X of cardinality n + 1, e.g. X = {0} U [n]. 
Let Si, 52 G Tx,r- Choosing the node (or any other leaf) as a root node, we arrive 
at sets Si,£>2 G 7ZT[ n ], r - By definition of the rooted disentangling number, there is a 
K C [n] of RD(r) elements such §i\k ^ S2l.fi:- This implies that the set {0} U K satisfies 

Sl\{0}UK 7^ S2\{0}UK- n 

Theorem |2] then follows as a corollary of the following result. 

Theorem 4. RD(r) = 3(Llog 2 (r)J + 1). 

To prove the inequality RD{r) < 3( [log 2 (r)J + 1) of Theorem 4l we use the known value 
RD(1) = 3 (i.e. rooted triples determined a rooted tree (see e.g. [5])) to encode multisets of 

trees as high dimensional contingency tables. To this end, consider the 3^ 3 /-dimensional 
space 

Qx ■= IR = Q£) R(ei\j k ,ej\ ik ,e k \ij). 

{v»e(f) 

Coordinates on this space are indexed by the lists of ( 3 )-rooted triplets. Each rooted 
X-leaf trivalent tree T gives rise to a uniquely determined standard unit vector 

er := ® { i, j>k }e( x 3 ) e T\ {i , m e Qx- 

This uniqueness is a consequence of the fact that rooted triples in a rooted tree uniquely 
determine the tree. 

An unordered list of trees S = (Ti, . . . , T r ) G Tx,r gives rise to a nonnegative integer 
array 

u s = e Tl + e T2 -\ h e Tr 

in Qx, whose 1-norm is equal to r. Furthermore, the list S can be recovered from the 
vector U3. We will use this encoding of sets of trees to prove Theorem |4j 
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Let K be a finite set. For each k G K, let dk G N>i and consider the space 

M, dK = (g)M dfc , 
fcex 
with the standard unit vectors ®keK£j k - For each L C K, we get a linear map 

tt l : R dK ->■ K di , ® fce ^e jfc ^ ©fceLe jfe . 

Given w G M d/f , kl(u) is called the L-marginal of u. If T = {L 1; . . . , L s } is a set of subsets 
of K, we get an induced linear map 

s 

vr r : R dl< -* M d ^ , U H> tt Li (u) © • • • © tt Ls (u). 

which is the linear transformation that computes the r-marginals of u. Suppose now that 
T is closed downward, that is L G V and L'Ci implies that V G V. Such a T is called a 
simplicial complex. The elements of V are called the faces of V. 

Theorem 5. [3] Let V be a simplicial complex, let s be the cardinality of the smallest 
S C K not in V , and u G kerz7Tr with u/0. Then ||tt||i > 2 s . 

Proof that RD{r) < 3(|_log 2 (r)J + 1). Fixr, and suppose that D{r) > g(r) := 3(|log 2 (r)J + 
1). Then there exists two unordered lists of rooted binary trees Si,S2 G 7[ n ],r f° r n > g(r) 
such that for all k < g(r) and K G (v), we have S'iJjj- = S2I jc- 

Let T r be the simplicial complex with ground set (3) such that a set {Ki, . . . , K m } 
forms a face of T r if and only if 

#(idU-Ui( m )< S (r). 

Note that this implies that the size of the smallest S ^ T r has |_l°S2( r )J + 2 elements 
(obtained by taking that many disjoint triplets). 

The fact that Si\k = S2\k for all K G (^) with k < g(r) implies that ^^(usj = 
7Tr r (u5 2 ). Indeed, if L = {Ki, . . . , K m }, then i^l^t) is a table with a single nonzero 
entry which records which of the rooted triplets on K\, . . . , K m , that tree T has. Thus, 
^liusi) is a table which records which combinations of rooted triplets on Ki, . . . ,K m 
appear in the trees in Si. If #(K\ U • • • U K m ) < g(r), this information can be read off 
from Silidu-uKm = S2\K 1 u--uK m , and must be the same for both Si and S2. 

Since 7r rr (M Sl ) = tt^ r (us 2 )_, we see that us t — us 2 G ker Z 7r rr . Since Si ^ S 2 , v = 
us 1 — us 2 7^ 0. By Theorem p| we have ||i>||i > 2L log2 ( r )J +2 > 2r. On the other hand each 
Uq has one norm r, so ||v||i < 2r. This is a contradiction. D 

t ~ } % ' II II- 1 - — 

Remark. Note that the same argument for the upper bound would work even if our trees 
were not binary, by using vector spaces of dimension 4A 3 J ) since arbitrary rooted trees 
are determined by their rooted triplets (of which there are four possibilities). 

The lower bound RD(r) > 3( |_log 2 (r")J +1), can be deduced from an elegant construction 
of Humphries ([2] stated for unrooted trees), which we repeat here for completeness. 

Proof that RD(r) > 3([log 2 (r)J + 1). Suppose first that r = 2 fc_1 . Let T be a fixed, but 
otherwise arbitrary rooted leaf labeled binary tree with k leaves, labeled by [k] . We will 
construct sets of trees on 3k leaves which prove the lower bound. Now let X be the leaf 
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label set X = {ai,bi,Ci, ..., a k ,b k ,Ck}- On each triple we use two rooted trees a^Ci 
and bi\ciiCi, denoted t l and t\ respectively. Given a list of trees t e = (i^, . . . ,t*) with 
e G {0, l} fe , form the tree with 3k leaves T € by identifying the root on tree t\. with the 
label % on the leaf of T. 
Now let 

S k ,odd = {T e :ee{0,l} k , J2 e ^ = l mod2 } 

i 

S k ,even = {T e '. C 6 {0, 1}*, ^ Q = mod 2}. 

i 

For any subset K C X, with #if = 3/c — 1, we have S^^Ik = Steven I .K"- Indeed, 
this if omits one vertex, say a^ so that both triples ak\b k Ck and 6fc|afcCfc collapse to an 
identical cherry on bk and Ck in all trees. Thus, the trees in both Sk,odd\K and Sk,even\K 
are determined by all vectors e G {0, l} fc_1 . Note that #Sk,odd = #Sk, e ven = 2 fc_1 . This 
implies that RD(r) > 3k = 3(log 2 (r) + 1). 

If 2 fc_1 < r < 2 fe , we let T" be an arbitrary tree with 3k leaves on the label set X, and 
let Si and S 2 be the multisets Si = Sk, dd U {T"? • • • , ^'} an d 5*2 = Steven U {7"? • • • , ^'}> 
where we union with r — 2 fc_1 copies of T'. Then by the preceding argument S\\k = S2\k 
for all subsets KCX with #K = 3k - 1. □ 
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